70        Bioinformatics

sequence in the current record. The “HI” tag shows the index of the query hit. The “AS” tag

shows the alignment score defined by the aligner. The “NM” tag shows the edit distance,

which is defined as the minimal number of single-nucleotide edits (substitutions, inser-

tions, and deletions) needed to transform the read sequence into the aligned segment of

the reference sequence.

For more details about SAM file, read the specification of the Sequence Alignment/Map

file format, which is available at “https://samtools.github.io/hts-specs/”.

2.3.2  Read Aligners

There are several aligners available for mapping reads to a reference genome. However,

next we will discuss only BWA [9], Bowtie [17], and STAR [8] as examples. The use of other

aligners is similar.

When we move to read mapping step, we will have already had cleaned the raw reads

in FASTQ files as discussed in Chapter 1. In the following sections, we will show you

how to map reads contained in FASTQ files to a reference genome. For this purpose, we

will download FASTQ files (run # is SRR769545) from the NCBI SRA database. To avoid

repeating the quality control steps, we will assume that we have cleaned the FASTQ files

following the steps in Chapter 1 and the files are ready for mapping. Our example raw data

are paired-end reads from the 1000 Genomes whole exome sequencing of an individual

from the Great Britain population. We can download the two FASTQ files (forward and

reverse) using the SRA-toolkit. The following commands create the directory “data” where

the two FASTQ files will be downloaded:

mkdir data

cd data

fasterq-dump --verbose SRR769545

The size of each file is 11G; the two files take around 22G of storage space. To save storage

space, we can compress these files using gzip utility, which will reduce each file to only

2.6G. Most aligners accept the gzipped FASTQ files.

gzip SRR769545_1.fastq

gzip SRR769545_2.fastq

The “.gz” will be added to the name of each file to indicate that the two files were com-

pressed with gzip.

We can also run FastQC and display the QC reports as follows:

fastqc SRR769545_1.fastq.gz SRR769545_2.fastq.gz

firefox SRR769545_1_fastqc.html SRR769545_2_fastqc.html

The per base quality reports for the two FASTQ files are shown in Figure 2.17. We can

notice that the reads in the two files have a good quality.